lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

Reinforcement learning.html (10315B)


      1 
      2 				<!DOCTYPE html>
      3 				<html>
      4 					<head>
      5 						<meta charset="UTF-8">
      6 						<link rel="stylesheet" href="pluginAssets/highlight.js/atom-one-light.css">
      7 						<title>Reinforcement learning</title>
      8 					<link rel="stylesheet" href="pluginAssets/katex/katex.css" /><link rel="stylesheet" href="./style.css" /></head>
      9 					<body>
     10 
     11 <div id="rendered-md"><h1 id="reinforcement-learning">Reinforcement learning</h1>
     12 <nav class="table-of-contents"><ul><li><a href="#reinforcement-learning">Reinforcement learning</a><ul><li><a href="#what-is-reinforcement-learning">What is reinforcement learning?</a></li><li><a href="#approaches">Approaches</a><ul><li><a href="#random-search">Random search</a></li><li><a href="#policy-gradient">Policy gradient</a></li><li><a href="#q-learning">Q-learning</a></li></ul></li><li><a href="#alpha-stuff">Alpha-stuff</a><ul><li><a href="#alphago">AlphaGo</a></li><li><a href="#alphazero">AlphaZero</a></li><li><a href="#alphastar">AlphaStar</a></li></ul></li></ul></li></ul></nav><h2 id="what-is-reinforcement-learning">What is reinforcement learning?</h2>
     13 <p>Agent is in a state, takes an action.<br>
     14 Action is selected by policy - function from states to actions.<br>
     15 The environment tells the agent its new state, and provides a reward (number, higher is better).<br>
     16 The learner adapts the policy to maximise expectation of future rewards.</p>
     17 <p>Markov decision process: optimal policy may not depend on previous state, only info in current state counts.</p>
     18 <p><img src="_resources/e78427ef0d0845d0ae21e1c7857d2740.png" alt="90955f3da8fb0d61c2fa9f3033c65098.png"></p>
     19 <p>Sparse loss:</p>
     20 <ul>
     21 <li>start with imitation learning - supervised learning, copying human action</li>
     22 <li>reward shaping - guessing reward for intermediate states, or states close to good states</li>
     23 <li>auxiliary goals - curiosity, max distance traveled</li>
     24 </ul>
     25 <p>policy network: NN with input of state, output of action, and a softmax output layer to produce prob distribution.</p>
     26 <p>three problems of RL:</p>
     27 <ul>
     28 <li>non differentiable loss</li>
     29 <li>balance exploration and exploitation
     30 <ul>
     31 <li>this is a classic trade-off in online learning</li>
     32 <li>for example, an agent in a maze may train to reach a reward of 1 that's close by and exploit that reward, and so it might never explore further and reach the 100 reward at the end of the maze</li>
     33 </ul>
     34 </li>
     35 <li>delayed reward/sparse loss
     36 <ul>
     37 <li>you might take an action that causes a negative result, but the result won't show up until some time later</li>
     38 <li>for example, if you start studying before an exam, that's a good thing.<br>
     39 the issue is that you started one day before, and didn't do jack shit during the preceding two weeks.</li>
     40 <li>credit assignment problem: how do you know which action takes the credit for the bad result?</li>
     41 </ul>
     42 </li>
     43 </ul>
     44 <p>deterministic policy - every state followed by same action.<br>
     45 probabilistic policy - all actions possible, certain actions higher probability.</p>
     46 <h2 id="approaches">Approaches</h2>
     47 <p>how do you choose the weights (how do you learn)?<br>
     48 simple backpropagation doesn't work - we don't have labeled examples to tell us which move to take for given state.</p>
     49 <h3 id="random-search">Random search</h3>
     50 <p>pick random point m in model space.</p>
     51 <pre class="hljs"><code><span class="hljs-attr">loop</span>:<span class="hljs-string"></span>
     52     <span class="hljs-attr">pick</span> <span class="hljs-string">random point m' close to m</span>
     53     <span class="hljs-attr">if</span> <span class="hljs-string">loss(m') &lt; loss(m):</span>
     54         <span class="hljs-attr">m</span> <span class="hljs-string">&lt;- m'</span>
     55 </code></pre>
     56 <p>&quot;close to&quot; is sampled uniformly among all points with some pre-chosen distance r from w.</p>
     57 <h3 id="policy-gradient">Policy gradient</h3>
     58 <p>follow some semi-random policy, wait until reach reward state, then label all previous state-action pairs with final outcome.<br>
     59 i.e. if some actions were bad, on average will occur more often in sequences ending with negative reward, and on average will be more often labeled as bad.</p>
     60 <p><img src="_resources/c484829362004f90be2b33a92acf7fd9.png" alt="442f7f9bc5e14ffbbcfd54f6ea6b72df.png"></p>
     61 <p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∇</mi><msub><mi>𝔼</mi><mi>a</mi></msub><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mo>=</mo><mi mathvariant="normal">∇</mi><msub><mo>∑</mo><mi>a</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>𝔼</mi><mi>a</mi></msub><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mi mathvariant="normal">∇</mi><mi>ln</mi><mo>⁡</mo><mrow><mi>p</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><annotation encoding="application/x-tex">\nabla 𝔼_a r(a) = \nabla \sum_{a} p(a) r(a) = 𝔼_{a} r(a) \nabla \ln{p(a)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∇</span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">a</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1.0497100000000001em;vertical-align:-0.29971000000000003em;"></span><span class="mord">∇</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:-0.0000050000000000050004em;">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.0016819999999999613em;"><span style="top:-2.40029em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">a</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.29971000000000003em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault">p</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">a</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mord">∇</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop">ln</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathdefault">p</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span></span></span></span></span>, r is the ultimate reward at the end of the trajectory.</p>
     62 <h3 id="q-learning">Q-learning</h3>
     63 <p>If I need this, I'll make better notes, can't really understand it from the slides.</p>
     64 <h2 id="alpha-stuff">Alpha-stuff</h2>
     65 <h3 id="alphago">AlphaGo</h3>
     66 <p>starts with imitation learning.<br>
     67 improve by playing against previous iterations and self. trained by reinforcement learning using policy gradient descent to update weights.<br>
     68 during play, use Monte Carlo Tree Search, with node values being the prob that black will win from that state.</p>
     69 <h3 id="alphazero">AlphaZero</h3>
     70 <p>learns from scratch, there's no imitation learning or reward shaping.<br>
     71 also applicable to other games like chess.</p>
     72 <p>Improves AlphaGo by:</p>
     73 <ul>
     74 <li>combining policy and value nets</li>
     75 <li>viewing MCTS as policy improvement operator</li>
     76 <li>adding residual connections, batch normalization</li>
     77 </ul>
     78 <h3 id="alphastar">AlphaStar</h3>
     79 <p>This shit can play starcraft.</p>
     80 <p>Real time, imperfect information, large diverse action space, and no single best strategy.<br>
     81 Its behaviour is generated by a deep NN that gets input from game interface, and outputs instructions that are an action in the game.</p>
     82 <p>it has a transformer torso for units<br>
     83 deep LSTM core with autoregressive policy head, and pointer network.<br>
     84 makes use of multi-agent learning.</p>
     85 </div></div>
     86 					</body>
     87 				</html>